Ryan Schaefer
import numpy as np
from matplotlib import pyplot as plt
import os
import cv2
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import copy
from skimage.feature import daisy, match_descriptors
import pandas as pd
This data set contains training and testing sets of x-ray images, where some of the images are of a patient with pneumonia and the rest do not have pneumonia (labeled as normal). The training data contains 5,242 images (1,349 normal and 3,883 with pneumonia) and the testing data contains 624 images (234 normal and 390 with pneumonia). For this lab, only the training data will be used.
Pneumonia is a serious issue in children's health, as "pneumonia kills about 2 million children under 5 years old eery year and is consistently estimated as the single leading cause of childhood mortality" (1). The goal of this data set is to screen children for pneumonia prior to visiting a doctor. Doctors (and everyone working in their office) who see child patients who may have pneumonia will be interested in the results of a prediction algorithm, as an accurate algorithm can help the doctor make a diagnosis and filter the number of patients the doctor needs to see. If doctors are going to filter the patients they see based on a prediction algorithm, then the false negative rate needs to be low. If the result is a false positive, then the doctor will see the patient and on further inspection determine that they do not have pneumonia, so no harm is done. However, if the result is a false negative, then the doctor may not see a patient who needs treatment for pneumonia. Thus, the overall prediction accuracy and false positive rates are not as important as the false negative rate, which should be as low as possible. A useful prediction algorithm should be able to find the pneumonia in a patient at least 95% of the time (less than 5% false negative rate).
Link to data set: https://www.kaggle.com/datasets/tolgadincer/labeled-chest-xray-images
Other references:
%%time
# Directory containing images data
data_dir = "../../Datasets/chest_xray/train"
# Subdirectories with normal and pneumonia images
categories = ["NORMAL", "PNEUMONIA"]
# Size to resize images to
size = (224, 224)
labels = []
images = []
for cat in categories:
# For each label, create path to subdirectory
path = os.path.join(data_dir, cat)
for file in os.listdir(path):
# Read and resize all images in subdirectory
labels.append(cat)
img = cv2.imread(os.path.join(path, file), cv2.IMREAD_GRAYSCALE)
img = cv2.resize(img, size)
images.append(img)
CPU times: user 14 s, sys: 1.52 s, total: 15.6 s Wall time: 17.5 s
# Convert 3D array of image pixels into 2D array with one row per image
sample_size = len(images)
images = np.array(images).flatten().reshape(sample_size, size[0] * size[1])
n_samples, n_features = images.shape
h, w = size
classes = list(set(labels))
n_classes = len(classes)
print(np.sum(~np.isfinite(images)))
print("n_samples: {}".format(n_samples))
print("n_features: {}".format(n_features))
print("n_classes: {}".format(n_classes))
print("classes: {}".format(classes))
print("Image Sizes {} by {}".format(h,w))
print (h * w) # the size of the images are the size of the feature vectors
0 n_samples: 5232 n_features: 50176 n_classes: 2 classes: ['PNEUMONIA', 'NORMAL'] Image Sizes 224 by 224 50176
# a helper plotting function
def plot_gallery(images, titles, h, w, n_row=3, n_col=6):
"""Helper function to plot a gallery of portraits"""
plt.figure(figsize=(1.7 * n_col, 2.3 * n_row))
plt.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35)
for i in range(n_row * n_col):
plt.subplot(n_row, n_col, i + 1)
plt.imshow(images[i].reshape((h, w)), cmap=plt.cm.gray)
plt.title(titles[i], size=12)
plt.xticks(())
plt.yticks(())
The images below are a selection of 18 x-rays of patients without pneumonia.
# Get the index of the first pneumonia image and only display normal images
pneum_index = labels.index("PNEUMONIA")
plot_gallery(images[:pneum_index], labels[:pneum_index], h, w) # defaults to showing a 3 by 6 subset of the faces
The images below are a selection of 18 x-rays of patients with pneumonia.
# Only display pneumonia images
plot_gallery(images[pneum_index:], labels[pneum_index:], h, w) # defaults to showing a 3 by 6 subset of the faces
Principle Component Analysis (PCA) is a dimensionality reduction technique that attempts to represent as much of the original data as possible with a minimal number of features. Transforming a data set with PCA significantly reduces the computing power required to analyze the data while retaining as much data as possible. Image classification can be done with PCA by finding the most similar PCA transformed image with a label to the image being classified. The full PCA of a data set is computed by finding the eigenvectors of the covariance matrix and keeping the largest eigenvectors (components).
To perform PCA on these images, we need to determine the number of components we need that will explain the majority of the variance in the images. As we reduce the number of components, the computing power required for analysis decreases, but our ability to accurately represent the original data also decreases.
%%time
pca = PCA().fit(images)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of Components')
plt.ylabel('Explained Variance')
plt.show()
CPU times: user 26min 32s, sys: 6min 34s, total: 33min 6s Wall time: 3min 10s
The plot above shows the proportion of the variance that can be explained by a given number of components. The explained variance appears to begin flattening out at approximately 1,000 components. We can make another plot that only considers component counts less than 1,000 to determine the optimal number of components.
# function for producing graph of explained variance and number of components
def plot_explained_variance(pca):
import plotly
from plotly.graph_objs import Bar, Line
from plotly.graph_objs import Scatter, Layout
from plotly.graph_objs.scatter import Marker
from plotly.graph_objs.layout import XAxis, YAxis
plotly.offline.init_notebook_mode()
explained_var = pca.explained_variance_ratio_
cum_var_exp = np.cumsum(explained_var)
plotly.offline.iplot({
"data": [
Bar(y=explained_var, name='individual explained variance'),
Scatter(y=cum_var_exp, name='cumulative explained variance')
],
"layout": Layout(xaxis=XAxis(title='Principal components'), yaxis=YAxis(title='Explained variance ratio'))
})
%%time
# Set the number of components to consider
n_components = 1000
# Fit the PCA model on the data
pca = PCA(n_components = n_components, svd_solver = 'full')
pca.fit(images)
# Compute the cumulative explained variance ratio
cumulative_var = np.cumsum(pca.explained_variance_ratio_)
# Set the desired amount of variance to capture
desired_var = 0.95
# Find the number of components needed to capture the desired amount of variance
n_components_needed = np.argmax(cumulative_var >= desired_var) + 1
try:
assert True in (cumulative_var >= desired_var)
print("Number of components needed to capture", desired_var, "of the variance:", n_components_needed)
plot_explained_variance(pca)
except AssertionError:
print("Desired variance not met")
Number of components needed to capture 0.95 of the variance: 645
CPU times: user 29min 18s, sys: 6min 4s, total: 35min 22s Wall time: 3min 22s
The plot above shows the proportion of variance that is explained by a given number of components using Full PCA. To capture 95% of the variance, we need 694 components, which we will use to compute the PCA for our images.
%%time
n_components = n_components_needed
print ("Extracting the top %d eigenfaces from %d faces" % (n_components, images.shape[0]))
pca = PCA(n_components = n_components, svd_solver = 'full',)
pca.fit(images.copy())
eigenfaces = pca.components_.reshape((n_components, h, w))
Extracting the top 645 eigenfaces from 5232 faces CPU times: user 22min 59s, sys: 3min 59s, total: 26min 58s Wall time: 2min 29s
plot_gallery(eigenfaces, labels, h, w)
The images above show the Full PCA representations of a selection of x-ray images. These images are more blurry than the original images, but the general structure of the images is still visible. This should be a good enough representation of the original data to make predictions.
The problem with Full PCA is that it can be slow to compute with a large number of features or observations. An alternative option is Randomized PCA, which uses an approximation of the covariance matrix of lower rank. This is much faster to compute, but may not give as accurate of a representation of the data as Full PCA.
%%time
# Set the number of components to consider
n_components = 1000
# Fit the PCA model on the data
rpca = PCA(n_components = n_components, svd_solver = 'randomized')
rpca.fit(images)
# Compute the cumulative explained variance ratio
cumulative_var = np.cumsum(rpca.explained_variance_ratio_)
# Set the desired amount of variance to capture
desired_var = 0.95
# Find the number of components needed to capture the desired amount of variance
n_components_needed = np.argmax(cumulative_var >= desired_var) + 1
try:
assert True in (cumulative_var >= desired_var)
print("Number of components needed to capture", desired_var, "of the variance:", n_components_needed)
plot_explained_variance(rpca)
except AssertionError:
print("Desired variance not met")
Number of components needed to capture 0.95 of the variance: 645
CPU times: user 5min 8s, sys: 33.7 s, total: 5min 42s Wall time: 32.1 s
The plot above shows the proportion of variance that is explained by a given number of components using Randomized PCA. To capture 95% of the variance, we need 694 components (same as Full PCA), which we will use to compute the Randomized PCA for our images.
%%time
n_components = n_components_needed
print ("Extracting the top %d eigenfaces from %d faces" % (n_components, images.shape[0]))
rpca = PCA(n_components = n_components, svd_solver = 'randomized')
rpca.fit(images.copy())
eigenfaces = rpca.components_.reshape((n_components, h, w))
Extracting the top 645 eigenfaces from 5232 faces CPU times: user 3min 35s, sys: 23.1 s, total: 3min 58s Wall time: 21.8 s
The Randomized PCA for this data set was computed in 21.8 seconds. This is much faster than the Full PCA, which was computed in 2 minutes and 29 seconds.
plot_gallery(eigenfaces, labels, h, w)
The images above show the Randomized PCA representations of a selection of x-ray images. These images are very similar to the Full PCA visually, but may be different when analyzed computationally.
# Modified version of Dr Larson's code to compare Full PCA to Randomized PCA
# Instead of comparing Full PCA to DAISY
# init a classifier for each feature space
knn_pca = KNeighborsClassifier(n_neighbors=1)
knn_rpca = KNeighborsClassifier(n_neighbors=1)
# transform images with PCA/RPCA
pca_features = pca.transform(copy.deepcopy(images))
rpca_features = rpca.transform(copy.deepcopy(images))
# separate the data into train/test
pca_train, pca_test, rpca_train, rpca_test, labels_train, labels_test = train_test_split(
pca_features, rpca_features, labels, test_size=0.2, train_size=0.8, random_state = 7324)
# fit each classifier
knn_pca.fit(pca_train,labels_train)
acc_pca = accuracy_score(knn_pca.predict(pca_test),labels_test)
knn_rpca.fit(rpca_train,labels_train)
acc_rpca = accuracy_score(knn_rpca.predict(rpca_test),labels_test)
# report accuracy
print(f"Full PCA accuracy:{100*acc_pca:.2f}%, Randomized PCA Accuracy:{100*acc_rpca:.2f}%".format())
Full PCA accuracy:91.98%, Randomized PCA Accuracy:92.07%
The Full PCA and Randomized PCA transformations of the data were split into training and testing sets (80% training and 20% testing) to determine the accuracy of predicting if a patient has pneumonia or not based on the Full PCA and Randomized PCA transformations of an x-ray image. To make a prediction, a K Nearest Neighbors (KNN) classifier with k of 1 was fit using the training data for each algorithm. Each image in the testing data was then classified using the trained classifier. The label of the most similar image in the training data is assigned to each image in the testing data.
The overall accuracy for the Full PCA classifier is 91.88% and 92.07 percent for the Randomized PCA classifier. These accuracy measures are similar enough that the prediction accuracy of the two algorithms is effectively the same. However, we stated in the business understanding that we care more about the false negative rate than the overall accuracy.
# Get the accuracy by class for each algorithm
correct = np.zeros(4)
pca_guess = knn_pca.predict(pca_test)
rpca_guess = knn_rpca.predict(rpca_test)
for i in range(len(labels_test)):
if labels_test[i] == pca_guess[i]:
if labels_test[i] == "NORMAL":
correct[0] += 1
else:
correct[1] += 1
if labels_test[i] == rpca_guess[i]:
if labels_test[i] == "NORMAL":
correct[2] += 1
else:
correct[3] += 1
totals = pd.Series.value_counts(labels_test)
print("Full PCA:")
print(f"Normal Accuracy: {(100*correct[0]/totals['NORMAL']):0.2f}%")
print(f"Pneumonia Accuracy: {(100*correct[1]/totals['PNEUMONIA']):0.2f}%")
print()
print("Randomized PCA:")
print(f"Normal Accuracy: {(100*correct[2]/totals['NORMAL']):0.2f}%")
print(f"Pneumonia Accuracy: {(100*correct[3]/totals['PNEUMONIA']):0.2f}%")
Full PCA: Normal Accuracy: 79.62% Pneumonia Accuracy: 96.16% Randomized PCA: Normal Accuracy: 80.00% Pneumonia Accuracy: 96.16%
The percentages above show the percent accuracy of Full and Randomized PCA separated by label. The normal accuracy measures the percentage of the time the algorithm classified a patient as not having pneumonia when they do not have pneumonia. The pneumonia accuracy measures the percentage of the time the algorithm classified a patient as having pneumonia when they do have pneumonia. The false negative rate (predicting no pneumonia when the patient has pneumonia) can be calculated as 100% - the pneumonia accuracy.
The Randomized PCA classifier was slightly more accurate than the Full PCA classifier at identifying patients without pneumonia (80% vs 79.62%), and the classifiers had identical accuracy for patients with pneumonia (96.16%). Thus, the false negative rate for these classifiers is 3.84%.
Given that the Randomized PCA was significantly faster to compute than the Full PCA and the accuracies are approximately the same, the Randomized PCA is the preferred method of predicting if a patient has pneumonia from x-ray images.
Another method of image classification is feature extraction, which can be done with DAISY. DAISY uses select pixels separated by a predefined distance (step). For each selected pixel, DAISY creates a predefined number of circles in one or more rings around the pixel and stores a histogram of the average gradient in each direction for each circle. These histograms are concatenated together to create a feature vector at that pixel location.
# create a function to take in the row of the matrix and return a new feature
def apply_daisy(row,shape):
feat = daisy(row.reshape(shape), step=10, radius=20,
rings=2, histograms=8, orientations=4,
visualize=False)
return feat.reshape((-1))
%%time
# Calculate DAISY features
daisy_features = np.apply_along_axis(apply_daisy, 1, images, (h,w))
CPU times: user 1min 51s, sys: 10.3 s, total: 2min 1s Wall time: 1min 59s
Computing DAISY feature vectors is slower than Randomized PCA (1 minute 59 seconds vs 21.8 seconds), but faster than Full PCA (1 minute 59 seconds vs 2 minutes 29 seconds).
# init a classifier for each feature space
knn_pca = KNeighborsClassifier(n_neighbors=1)
knn_rpca = KNeighborsClassifier(n_neighbors=1)
knn_dsy = KNeighborsClassifier(n_neighbors=1)
# separate the data into train/test
pca_train, pca_test, rpca_train, rpca_test, dsy_train, dsy_test, labels_train, labels_test = train_test_split(
pca_features, rpca_features, daisy_features, labels, test_size=0.2, train_size=0.8, random_state = 7324)
# fit each classifier
knn_pca.fit(pca_train,labels_train)
acc_pca = accuracy_score(knn_pca.predict(pca_test),labels_test)
knn_rpca.fit(rpca_train,labels_train)
acc_rpca = accuracy_score(knn_rpca.predict(rpca_test),labels_test)
knn_dsy.fit(dsy_train,labels_train)
acc_dsy = accuracy_score(knn_dsy.predict(dsy_test),labels_test)
# report accuracy
print(f"Full PCA accuracy: {100*acc_pca:.2f}%".format())
print(f"Randomized PCA accuracy: {100*acc_rpca:.2f}%".format())
print(f"DAISY accuracy: {100*acc_dsy:.2f}%".format())
Full PCA accuracy: 91.98% Randomized PCA accuracy: 92.07% DAISY accuracy: 94.27%
Predictions from the DAISY features are determined using a KNN classifier, just like for Full PCA and Randomized PCA. The overall accuracy of DAISY is larger than the accuracies of Full and Randomized PCA (94.27% vs 91.98% and 92.07%), but the false negative rate is the better evaluation metric for this prediciton task.
# Get the accuracy by class for each algorithm
correct = np.zeros(6)
pca_guess = knn_pca.predict(pca_test)
rpca_guess = knn_rpca.predict(rpca_test)
dsy_guess = knn_dsy.predict(dsy_test)
for i in range(len(labels_test)):
if labels_test[i] == pca_guess[i]:
if labels_test[i] == "NORMAL":
correct[0] += 1
else:
correct[1] += 1
if labels_test[i] == rpca_guess[i]:
if labels_test[i] == "NORMAL":
correct[2] += 1
else:
correct[3] += 1
if labels_test[i] == dsy_guess[i]:
if labels_test[i] == "NORMAL":
correct[4] += 1
else:
correct[5] += 1
totals = pd.Series.value_counts(labels_test)
print("Full PCA:")
print(f"Normal Accuracy: {(100*correct[0]/totals['NORMAL']):0.2f}%")
print(f"Pneumonia Accuracy: {(100*correct[1]/totals['PNEUMONIA']):0.2f}%")
print()
print("Randomized PCA:")
print(f"Normal Accuracy: {(100*correct[2]/totals['NORMAL']):0.2f}%")
print(f"Pneumonia Accuracy: {(100*correct[3]/totals['PNEUMONIA']):0.2f}%")
print()
print("DAISY:")
print(f"Normal Accuracy: {(100*correct[4]/totals['NORMAL']):0.2f}%")
print(f"Pneumonia Accuracy: {(100*correct[5]/totals['PNEUMONIA']):0.2f}%")
Full PCA: Normal Accuracy: 79.62% Pneumonia Accuracy: 96.16% Randomized PCA: Normal Accuracy: 80.00% Pneumonia Accuracy: 96.16% DAISY: Normal Accuracy: 95.85% Pneumonia Accuracy: 93.73%
Although DAISY's overall accuracy is higher than the accuracy of Full and Randomized PCA, this is because DAISY is much better at identifying x-rays without pneumonia than the PCA algorithms (95.85% vs 79.62% and 80%). DAISY is actually worse at identifying pneumonia than the PCA algorithms (93.73% vs 96.16%). Thus, Randomized PCA appears to be the better choice for this data.
An alternative DAISY prediction method to feature extraction is key point matching. Key point matching looks at the number of matching points in an image rather than comparing feature vectors. Key point matching takes the key points of a training image and a testing image and finds how many matches the images have. For each testing image, matches are found with all training images, and the largest match count (or percentage) is used to classify the testing image. The obvious problem with this is that a brute force algorithm to compare all pairs of images will be computationally expensive, but does it provide more prediction accuracy that offsets this expense? To ensure that this code runs in a reasonable amount of time, only 20% of the images (about 1,000) will be used (80% for training and 20% for testing).
def daisy_key_point(images_train, images_test, labels_train):
def apply_daisy(row,shape): # no reshape in this function
feat = daisy(row.reshape(shape), step=5, radius=5,
rings=2, histograms=8, orientations=4,
visualize=False)
s = feat.shape # PxQxR
#P = ceil((Height - radius*2) / step)
#Q = ceil((Width - radius*2) / step)
#R = (rings * histograms + 1) * orientations
return feat.reshape((s[0]*s[1],s[2]))
labels_guess = []
for img in images_test:
d1 = apply_daisy(img, (h,w))
best_matches = -1
best_index = -1
for i, img2 in enumerate(images_train):
d2 = apply_daisy(img2, (h,w))
matches = match_descriptors(d1, d2, cross_check=True, max_ratio=0.8)
if matches.shape[0] > best_matches:
best_matches = matches.shape[0]
best_index = i
labels_guess.append(labels_train[best_index])
return labels_guess
images_train, images_test, labels_train, labels_test = train_test_split(
images, labels, test_size=0.2, train_size=0.8, random_state = 7324)
images_train, images_test, labels_train, labels_test = train_test_split(
images_test, labels_test, test_size=0.2, train_size=0.8, random_state = 7324)
print(pd.Series(labels_train).value_counts())
print(pd.Series(labels_test).value_counts())
PNEUMONIA 623 NORMAL 214 dtype: int64 PNEUMONIA 159 NORMAL 51 dtype: int64
%%time
labels_guess = daisy_key_point(images_train, images_test, labels_train)
pd.Series(labels_guess).value_counts()
CPU times: user 3h 47min 56s, sys: 19min 29s, total: 4h 7min 25s Wall time: 4h 7min 41s
PNEUMONIA 143 NORMAL 67 dtype: int64
As previously mentioned, key point matching with DAISY is very slow! Even with a subset of only 1,000 of the 5,000 images, key point matching still took over 4 hours to run!
correct = 0
for i in range(len(labels_test)):
if labels_test[i] == labels_guess[i]:
correct += 1
print(f"Accuracy: {100*correct/len(labels_test):0.2f}%")
Accuracy: 91.43%
The overall prediction accuracy of key point matching appears to be slightly lower than some of the other algorithms, but let's rerun the other algorithms on this subset of the data to make sure.
pca_train, pca_test, rpca_train, rpca_test, dsy_train, dsy_test = train_test_split(
pca_test, rpca_test, dsy_test, test_size=0.2, train_size=0.8, random_state = 7324)
# init a classifier for each feature space
knn_pca = KNeighborsClassifier(n_neighbors=1)
knn_rpca = KNeighborsClassifier(n_neighbors=1)
knn_dsy = KNeighborsClassifier(n_neighbors=1)
# fit each classifier
knn_pca.fit(pca_train,labels_train)
acc_pca = accuracy_score(knn_pca.predict(pca_test),labels_test)
knn_rpca.fit(rpca_train,labels_train)
acc_rpca = accuracy_score(knn_rpca.predict(rpca_test),labels_test)
knn_dsy.fit(dsy_train,labels_train)
acc_dsy = accuracy_score(knn_dsy.predict(dsy_test),labels_test)
# report accuracy
print(f"Full PCA accuracy: {100*acc_pca:.2f}%".format())
print(f"Randomized PCA accuracy: {100*acc_rpca:.2f}%".format())
print(f"DAISY accuracy: {100*acc_dsy:.2f}%".format())
print(f"DAISY Key Point Matching accuracy: {100*correct/len(labels_test):0.2f}%")
Full PCA accuracy: 89.52% Randomized PCA accuracy: 89.05% DAISY accuracy: 93.33% DAISY Key Point Matching accuracy: 91.43%
The overall prediction of each algorithm appears to be slightly lower for this subset of the data than for the full data set. This makes sense, as more data makes it more likely to find a better matching image. Key point matching has a lower overall accuracy than DAISY (91.43% vs 93.33%), but a slightly higher accuracy than Full and Randomized PCA (91.41% vs 89.52% and 89.05%).
# Get the accuracy by class for each algorithm
correct = np.zeros(8)
pca_guess = knn_pca.predict(pca_test)
rpca_guess = knn_rpca.predict(rpca_test)
dsy_guess = knn_dsy.predict(dsy_test)
dsy_kp_guess = labels_guess
for i in range(len(labels_test)):
if labels_test[i] == pca_guess[i]:
if labels_test[i] == "NORMAL":
correct[0] += 1
else:
correct[1] += 1
if labels_test[i] == rpca_guess[i]:
if labels_test[i] == "NORMAL":
correct[2] += 1
else:
correct[3] += 1
if labels_test[i] == dsy_guess[i]:
if labels_test[i] == "NORMAL":
correct[4] += 1
else:
correct[5] += 1
if labels_test[i] == dsy_kp_guess[i]:
if labels_test[i] == "NORMAL":
correct[6] += 1
else:
correct[7] += 1
totals = pd.Series.value_counts(labels_test)
print("Full PCA:")
print(f"Normal Accuracy: {(100*correct[0]/totals['NORMAL']):0.2f}%")
print(f"Pneumonia Accuracy: {(100*correct[1]/totals['PNEUMONIA']):0.2f}%")
print()
print("Randomized PCA:")
print(f"Normal Accuracy: {(100*correct[2]/totals['NORMAL']):0.2f}%")
print(f"Pneumonia Accuracy: {(100*correct[3]/totals['PNEUMONIA']):0.2f}%")
print()
print("DAISY:")
print(f"Normal Accuracy: {(100*correct[4]/totals['NORMAL']):0.2f}%")
print(f"Pneumonia Accuracy: {(100*correct[5]/totals['PNEUMONIA']):0.2f}%")
print()
print("DAISY Key Point Matching:")
print(f"Normal Accuracy: {(100*correct[6]/totals['NORMAL']):0.2f}%")
print(f"Pneumonia Accuracy: {(100*correct[7]/totals['PNEUMONIA']):0.2f}%")
Full PCA: Normal Accuracy: 76.47% Pneumonia Accuracy: 93.71% Randomized PCA: Normal Accuracy: 74.51% Pneumonia Accuracy: 93.71% DAISY: Normal Accuracy: 94.12% Pneumonia Accuracy: 93.08% DAISY Key Point Matching: Normal Accuracy: 98.04% Pneumonia Accuracy: 89.31%
The performance of the first three algorithms relative to one another on this subset of the data is mostly the same as before, but Randomized PCA is now slightly less accurate than Full PCA on non-pneumonia patients (94.51% vs 76.47%), and the margin between DAISY and the PCA algorithms for pneumonia patients is narrower (93.08% vs 93.71% now and 93.73% vs 96.16% on the full data set).
Key point matching is extremely accurate at identifying patients that do not have pneumonia (98.04% vs 76.47%, 74.51%, and 94.12%). However, key point matching is the least accurate algorithm tested for patients with pneumonia (89.31% vs 93.71% and 93.08%). This makes the false negative rate 10.69%, which is not good enough for this prediction task. Although key point matching may be more accurate with a larger data set, the run time is so long compared to the other algorithms and the false negative rate with 1,000 images is so much worse than the other algorithms that it does not appear to be a viable option for this prediction task.